54

Algorithms for Binary Neural Networks

3.5.5

Forward Propagation Based on Projection Convolution Layer

For each full precision kernel Cl

i, the corresponding quantized kernels ˆCl

i,j are concatenated

to construct the kernel Dl

i that actually participates in the convolution operation as

Dl

i = ˆCl

i,1 ˆCl

i,2 ⊕· · · ⊕ˆCl

i,J,

(3.45)

wheredenotes the concatenation operation on the tensors. In PCNNs, the projection

convolution is implemented based on Dl and F l to calculate the next layer’s feature map

F l+1.

F l+1 = Conv2D(F l, Dl),

(3.46)

where Conv2D is the traditional 2D convolution. Although our convolutional kernels are

3D-shaped tensors, we design the following strategy to fit the traditional 2D convolution as

F l+1

h,j =



i,h

F l

h Dl

i,j,

(3.47)

F l+1

h

= F l

h,1 ⊕· · · ⊕F l

h,J,

(3.48)

wheredenotes the convolutional operation. F l+1

h,j

is the jth channel of the hth feature

map at the (l + 1)th convolutional layer and F l

h denotes the hth feature map at the lth

convolutional layer. To be more precise, for example, when h = 1, for the jth channel of

an output feature map, F l+1

1,j is the sum of the convolutions between all the h input feature

maps and i corresponding quantized kernels. All channels of the output feature maps are

obtained as F l+1

h,1 , .., F l+1

h,j , ..., F l+1

h,J , and they are concatenated to construct the hth output

feature map F l+1

h

.

It should be emphasized that we can utilize multiple projections to increase the diversity

of convolutional kernels Dl. However, the single projection can perform much better than the

existing BNNs. The essential is the use of DBPP, which differs from [147] based on a single

quantization scheme. Within our convolutional scheme, there is no dimension disagreement

on feature maps and kernels in two successive layers. Thus, we can replace the traditional

convolutional layers with ours to binarize widely used networks, such as VGGs and ResNets.

At inference time, we only store the set of quantized kernels Dl

i instead of the full-precision

ones; that is, projection matrices W l

j are not used for inference, achieving a reduction in

storage.

3.5.6

Backward Propagation

According to Eq. 3.44, what should be learned and updated are the full-precision kernels

Cl

i and the projection matrix W l (

W l) using the updated equations described below.

Updating Cl

i: We define δCi as the gradient of the full-precision kernel Ci, and have

δCl

i = ∂L

∂Cl

i

= ∂LS

∂Cl

i

+ ∂LP

∂Cl

i

,

(3.49)

Cl

i Cl

i η1δCl

i,

(3.50)

where η1 is the learning rate for the convolutional kernels. More specifically, for each item

in Eq. 3.49, we have

∂LS

∂Cl

i

=

J



j

∂LS

ˆCl

i,j

∂P l,j

ΩN (

W l

j, Cl

i)

(

W l

j Cl

i)

(

W l

j Cl

i)

∂Cl

i

=

J



j

∂LS

ˆCl

i,j

11

W l

jCl

i1

W l

j,

(3.51)